Biological Pattern Discovery with R Machine Learning Approaches (Zheng Rong Yang)

learning model can be constructed based on the feature space as

elow, where Ω௞ is a subspace to be generated using a cluster

model or a classification analysis model f(X),

݂ሺ܆ሻ⟹⋃Ω௞

(7.15)

understood that different species will reserve different sequence

in their genomes. A nucleotide sequence or a protein sequence is

assumed to be a string of genetic recombination of the nucleic

he amino acids. Therefore they will form different distributions

nce statistics. In natural language processing, a library of basic

nstitutes the basis for the classification of articles or books. A

e sequence or a protein sequence of a species can be treated as an

a book. It should contain some basic sequence statistics pattern

w short sub-sequences of the nucleic acids or the amino acids are

d within a genome or a whole sequence. Therefore, extracting

ence statistics to generate the frequency or library of sub-

s from a sequence can be used to compare or classify sequences,

ecies.

equence statistics mainly represent the distribution of words,

e consecutive sub-strings within a sequence string. These sub-

r words are called k-mers in the application to sequence

on. For instance, any nucleotide sequence is a chain or a string of

eic acids, i.e., ܛ௡∈ሺܣ, ܥ, ܩ, ܶሻ^ℓ^೙. In case k = 1 when using the

pproach to represent sequences, ܠ௡∈࣬^ସ. There are 16 2-mer

ch as AA, AC, AG, AT, CA, etc. Therefore, ܛ௡⟹ܠ௡∈࣬^ଵ଺ if

re used to represent nucleotide sequences. There are 64 3-mer

ch as AAA, AAC, etc in a nucleotide sequence. Therefore, ܛ௡⟹

ସ if 3-mers are used to represent nucleotide sequences.

ollection of all words for a specific size k is called a word set.

the times each word occurs in a sequence thus builds up a

statistics library or the word library. Similar sequences are

to have similar sequence statistics patterns.